Summarizing results

Elizabeth King
Kevin Middleton

Summarizing results

  • What to summarize?
  • How to summarize?
  • How to write about results?
  • What to figure in results?

Community standards: In progress

  • Much of this is new in the communities we are in
  • Standards are evolving
  • Error on the side of too much information
    • Reviewers and editors can ask for less

What do I want to learn from the posterior?

  • Estimates
    • Means / medians / modes
    • HDPIs (how wide?)
  • Differences of estimates
    • Means / medians / modes
    • HDPIs (how wide?)
  • Hypothesis tests
    • ROPEs (what is the ROPE?)
    • Bayes factors

What are your question(s)?

What you want to learn from the posterior is directly related to the questions you ask.

You planned these questions before designing the experiment and collecting the data.

What are your question(s)?

What are your question(s)?

While accounting for block-to-block variation:

  • Is low predation treatment different from control treatment?
  • Is high predation treatment different from control treatment?
  • By some measure of central tendency (mean, median, mode)
  • By some measure of difference (quantile, HPDI, ROPE)

How many samples is enough?

  • Means only: a few hundred
  • Middle 50% HDPI: ??one thousand??
  • Wide intervals (e.g., 89%, 95%, 99%): ??ten thousand???
    • Sample for more iterations
  • stan reports warnings for poor coverage in the tails

Sampling a model

  • brms default 1000 warmup + 1000 samples per chain
priors <- prior(normal(0, 3), class = b, lb = 0)

fm <- brm(zooplankton ~ treatment - 1 + (1 | block),
          data = D,
          prior = priors,
          seed = 4547359,
          chains = 4, cores = 4,
          control = list(adapt_delta = 0.99))
summary(fm)

Sampling a model

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: zooplankton ~ treatment - 1 + (1 | block) 
   Data: D (Number of observations: 15) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~block (Number of levels: 5) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.50      0.38     0.03     1.50 1.02      523      774

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
treatmentcontrol     2.94      0.39     2.07     3.67 1.00      750      411
treatmentlow         1.94      0.37     1.07     2.62 1.01      679      393
treatmenthigh        1.32      0.37     0.40     1.99 1.01      636      338

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.54      0.15     0.33     0.92 1.01     1026     1776

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Increasing iterations

  • Get the model working well before increasing iterations
    • More samples is not the solution to divergences
  • Sampling time scales approximately linearly
  • Can set warmup and sampling iterations separately
    • Shorter (but long enough) warmup, longer sampling
  • Don’t worry about the details at the tails.

Increasing iterations

fm <- brm(zooplankton ~ treatment - 1 + (1 | block),
          data = D,
          prior = prior(normal(0, 3), class = b, lb = 0),
          seed = 4547359,
          iter = 10000,
          chains = 4, cores = 4,
          control = list(adapt_delta = 0.99))
summary(fm)

Increasing iterations

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: zooplankton ~ treatment - 1 + (1 | block) 
   Data: D (Number of observations: 15) 
  Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
         total post-warmup draws = 20000

Group-Level Effects: 
~block (Number of levels: 5) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.50      0.38     0.04     1.45 1.00     3218     5474

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
treatmentcontrol     2.96      0.36     2.21     3.65 1.00     4439     3419
treatmentlow         1.94      0.36     1.17     2.65 1.00     4639     3375
treatmenthigh        1.33      0.36     0.56     2.02 1.00     3889     3181

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.55      0.16     0.33     0.93 1.00     5263     9319

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Drawbacks to increasing iterations

  • Slower sampling
    • Especially with lots of parameters and/or number of observations
    • Refine model using a subset of data
  • Large output files
    • All parameters sampled at every iteration

How large a posterior do I need?

Extracting the posterior

Note: several options here

post <- as_draws_df(fm)
str(post)
draws_df [20,000 × 15] (S3: draws_df/draws/tbl_df/tbl/data.frame)
 $ b_treatmentcontrol  : num [1:20000] 3.01 2.86 3.12 2.82 3.16 ...
 $ b_treatmentlow      : num [1:20000] 1.82 1.9 1.79 1.76 2.34 ...
 $ b_treatmenthigh     : num [1:20000] 1.33 1.3 1.39 1.37 1.75 ...
 $ sd_block__Intercept : num [1:20000] 0.517 0.407 0.917 0.43 1.107 ...
 $ sigma               : num [1:20000] 0.674 0.292 0.667 0.726 0.439 ...
 $ r_block[1,Intercept]: num [1:20000] 0.282 0.47 0.705 0.177 0.132 ...
 $ r_block[2,Intercept]: num [1:20000] 0.77 0.445 0.229 0.105 0.619 ...
 $ r_block[3,Intercept]: num [1:20000] -0.276 -0.101 -0.248 -0.291 -0.219 ...
 $ r_block[4,Intercept]: num [1:20000] -0.0279 -0.7576 0.2636 -0.4036 -0.4315 ...
 $ r_block[5,Intercept]: num [1:20000] 0.0205 0.2759 -0.1871 -0.0241 -0.3745 ...
 $ lprior              : num [1:20000] -7.28 -7.2 -7.38 -7.22 -7.6 ...
 $ lp__                : num [1:20000] -23.5 -22.1 -23.8 -23.8 -21.7 ...
 $ .chain              : int [1:20000] 1 1 1 1 1 1 1 1 1 1 ...
 $ .iteration          : int [1:20000] 1 2 3 4 5 6 7 8 9 10 ...
 $ .draw               : int [1:20000] 1 2 3 4 5 6 7 8 9 10 ...

Calculating contrasts

post <- post |> 
  select(starts_with("b_")) |> 
  mutate(low_v_control = b_treatmentlow - b_treatmentcontrol,
         high_v_control = b_treatmenthigh - b_treatmentcontrol)
head(post)
# A tibble: 6 × 5
  b_treatmentcontrol b_treatmentlow b_treatmenthigh low_v_control high_v_control
               <dbl>          <dbl>           <dbl>         <dbl>          <dbl>
1               3.01           1.82            1.33        -1.19           -1.68
2               2.86           1.90            1.30        -0.960          -1.56
3               3.12           1.79            1.39        -1.34           -1.74
4               2.82           1.76            1.37        -1.06           -1.46
5               3.16           2.34            1.75        -0.823          -1.42
6               3.07           1.87            1.18        -1.21           -1.89

Visualizing contrasts

post |> 
  pivot_longer(cols = ends_with("_control"),
               names_to = "Contrast",
               values_to = "Estimate") |> 
  ggplot(aes(Estimate, fill = Contrast)) +
  geom_density(alpha = 0.5) +
  labs(x = "Zooplankton Level\n(difference vs. control)") +
  theme(axis.title.y = element_blank())

Visualizing contrasts

Intervals to report

  • “Raw” parameter estimates (control, low, high)
    • Maybe / maybe not
    • You may have already reported group means
    • These are “block-aware” means from regularizing priors
  • Contrasts (i.e., hypothesis tests)
    • Other hypothesis tests and model comparisons (unit 5)

Parameter estimates

D |> group_by(treatment) |> summarise(Mean = mean(zooplankton))
# A tibble: 3 × 2
  treatment  Mean
  <fct>     <dbl>
1 control    3.02
2 low        2   
3 high       1.38
brms::fixef(fm)
                 Estimate Est.Error      Q2.5    Q97.5
treatmentcontrol 2.957024 0.3618012 2.2052729 3.652381
treatmentlow     1.943024 0.3626135 1.1723218 2.645919
treatmenthigh    1.328842 0.3609393 0.5576618 2.021403

Highest Density Intervals

post |> 
  summarize(across(.cols = everything(),
                   .fns = HDInterval::hdi,
                   credMass = 0.89))
# A tibble: 2 × 5
  b_treatmentcontrol b_treatmentlow b_treatmenthigh low_v_control high_v_control
               <dbl>          <dbl>           <dbl>         <dbl>          <dbl>
1               2.39           1.37           0.763        -1.60           -2.16
2               3.51           2.49           1.89         -0.477          -1.05
  • Rows are lower (5.5%) and upper (94.5%) bounds
  • Note: many options here
    • coda package also has HPDinterval() function (needs an mcmc class object)

Writing about Bayesian models

  • Strive for reproducibility
  • Evolving standards
  • Software is part of the model fitting
    • a t-test is a t-test is a t-test (mostly)
    • Bayesian model fitting differs in important and not-so important ways

Writing methods

  • Software, versions
  • Model community statements
    • Describe, not just “Bayesian t-test”
  • Priors, prior predictive checks
  • Sampling (warmup, iterations)
  • Effective sample size, \(\widehat{R}\)
  • Handling of posteriors
  • Hypothesis tests

Model statement and priors

Writing methods

“Models were fit using the Stan programming language (Carpenter et al., 2017; Gelman, Lee, & Guo, 2015) via the rethinking package (McElreath, 2015; https:// github.com/rmcelreath/rethinking) in R (ver. 3.6.2; (R Development Core Team, 2013). Each model was run with four chains in parallel for 10,000 iterations, yielding at least 4,000 effective samples for the six parameters of interest. Adequate sampling was assessed visually via rank histograms and \(\widehat{R}\) values ≤1.01 (Vehtari, Gelman, Simpson, Carpenter, & Burkner, 2019).”

Writing methods

“Models were estimated using Hamiltonian Monte Carlo via the stan statistical programming language (Stan Development Team 2019) with the Rstan package (Stan Development Team 2020) in R (R Core Team 2022). Models were sampled for 10,000 iterations with 50% warmup in four parallel chains. Starting values for parameters were drawn randomly from the priors separately for each chain. Model convergence was assessed by inspection of \(\widehat{R}\) values and rank histograms (Vehtari et al. 2019). After sampling, model parameter estimates had ~2,000-30,000 effective samples.”

Writing methods

“Models were estimated in R v. 4.2.0 (R Core Team, 2021) using the ‘brms’ package v. 2.17.1 (Bürkner, 2017, 2018), which is an interface for the Stan statistical programming language (Stan Development Team, 2019). Sampling was carried out using four chains in parallel with chains initialized by random draws from the priors. Chains were sampled for 10,000 iterations each with 50% warm-up using Hamiltonian Monte Carlo, which yielded effective sample sizes of ~8000-20,000 for main effects. Convergence was not problematic and was assessed by visual inspection of chains and \(\widehat{R}\) values near 1 in the posterior samples. \(\widehat{R}\) is a sampling diagnostic parameter that approaches 1 from above as the within- and between-chain estimates converge to the same distributions (Vehtari et al., 2019a).”

Writing results

  • Mirror the model(s)
  • Focus on parameter estimates and intervals

“Across all strains, mice with access to a running wheel were about 4 g lighter than the sedentary controls: MM = -3.9 g (95% HDI of Wheel vs Sedentary: -6.3 to -1.5 g), C3H/He = -4.0 g (95% HDI: -6.9 to -1.2 g), C57BL/6 = -4.5 g (95% HDI: -6.7 to -2.2 g).”

Figures 1: Prior predictive check

Figures 2: Comparison of model posteriors

Figures 3: Difference of posteriors

Figures 4: Many differences of posteriors